By Julien Hernandez Lallement, 2020-11-05, in category Course
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set_style('darkgrid')
import statsmodels.api as sm
def illustrate_svm(a=1.3, b=0.5, n=30, r=1, make_prediction=False, other_lines=False, main_split=False, margins=False):
"""
params:
a: slope
b: intercept
n: sample size
r: error
make_prediction: boolean. Does not make a real prediction, simply adds a value for illustrative purposes
random_lines: plots random regression lines
plot_errors: plot the difference between example points and regression line, for illustrative purposes
"""
np.random.seed(42)
def f(x, r_low=0, r_high=1): return (a * x + b + r * np.random.uniform(r_low, r_high, len(x)))
x = np.random.rand(n)
y = f(x)
fig, ax = plt.subplots()
ax.set_xlim([-0.05, 1.15])
# plot regression result
plt.scatter(x, y)
if main_split:
sns.regplot([0.7,0.5], [1, 2.25], ci=None, scatter=False)
if margins:
sns.regplot([0.74,0.54], [1, 2.25], ci=None, scatter=False, color='black')
sns.regplot([0.66,0.46], [1, 2.25], ci=None, scatter=False, color='black')
#plt.plot([[0.8,1], [0.4, 2.25]])
# plot random lines, similar to the fitted regression result
if other_lines:
sns.regplot([0.8,0.45], [1, 2.25], ci=None, scatter=False)
sns.regplot([0.65,0.55], [1, 2.25], ci=None, scatter=False)
# make prediction with regression result
if make_prediction:
x_ = np.array(max(x) + 0.1).reshape(1,1)
plt.plot(x_, f(x_), 'ro')
plt.xlabel('x', fontsize=15)
plt.ylabel('y', fontsize=15)
plt.show()
Support Vector Machines (SVM) are supervised learning classification methods. In a nutshell, this approach divides data respective to a hyperplance (vector in multi dimensional space) in feature space.
SVM can be used mainly with the following goal in mind:
Say you have two numerical features that you can plot in a two dimensional space
illustrate_svm()
One way to separate two groups of data point would be to draw a line to separate the different categories. Such line can be called a hyperplane
. A hyperplane is a subspace with dimension N-1, N being the number of dimensions of the space in which its located.
Let's draw one possible hyperplane
illustrate_svm(main_split=True)
The question is, which line would do the better job at separating two groups of data points?
illustrate_svm(main_split=True, other_lines=True)
One way to approach this problem is to define the best line
as the one that has the largest margins
between itself and the data points.
illustrate_svm(main_split=True, margins=True)
The goal of an SVM approach is to find the margins that are the furthest apart. You might wonder what the Support Vector part in SVM means? The support vectors are the data points from each category that are closest to the hyperplane. They support the hyperplane and if they were to change, the hyperplane placement would change as well. The Machine part in SVM? Not sure, it seems it simply means that this algorithm allows machine to learn...
The margins are depicted in the plot above, in black. This concept if called max-margin hyperplane, which is the mid-line of the largest gap we can find between the data points. Using a max-margin hyperplane is relatively seldom is real world data, and has a few down sides, one of the most important being its sensitivity to outliers (if one of the support points is an outlier from the category, then the algorithm would be overfitting the data).
Typically, users will find a balance between a max-margin hyperplane and too many violations of that hyperplane, that is, data points that wall on the other (wrong) side of the hyperplane. That parameter is typically referred to as C
, which can be nicely tunedvia cross-validation.
This section is taken from Tyler Folkamn slides: "A smaller value of C leads to more margin violations but a larger distance between the hyperplane and the support vectors. Inversely, a higher value has less margin violations, but a smaller distance between the hyperplane and support vectors. Reducing C leads to more bias, and increasing it more variance."
It turns out that to solve the SVM model the only computations required are inner products of the observations. The inner product here for two rows of data is just the sum of the features multiplied by eachother. If we have P features:
$$<x_{i},x_{j}> = \sum_{k=1}^{P}x_{ik}x_{jk}$$The kernel trick basically says that everytime we see an inner product when solving the SVM, we replace it with a more general form. The form shown above is known as the linear form. But we also have a polynomial form:
$$K(x_{i},x_{j}) = (1 + \sum_{k=1}^{P}x_{ik}x_{jk})^d$$In the polynomial form, we have a $d$ parameter which is the degree of polynomial we would like to use. It turns out that this is a much more efficient way of adding polynomial features to an SVM.
Another popular form is the radial form:
$$K(x_{i},x_{j}) = exp(-\gamma \sum_{k=1}^{P}(x_{ik} - x_{jk})^2)$$In this form, we have the $\gamma$ parameter which is > 0. If the two values being compared are very far, we get the exp of a large negative number, which is a very small number. This means that observations far away have a small effect, so this kernel have very local behavior. This is an especially cool kernel because the enlarged feature space is infinite-dimensional, so without the kernel trick, we could never do this!
Note: Choosing a large value for $\gamma$ increases variance and a small value increases bias. You can also try to find the best kernel and parameter values using cross-validation
I will use the famous Iris datasets that you can get directly from Sklearn
from sklearn.datasets import load_iris
df=load_iris(as_frame=True)
In this data, you get three different categories, setosa
, versicolor
and virginica
, with 50 observations each.
# Name dicft for legend mapping
target_name = {0: 'setosa',
1: 'versicolor',
2: 'virginica'}
# Plot data
sns.scatterplot(df.frame['petal length (cm)'],df.frame['sepal width (cm)'], hue=df.frame.target.map(target_name));
Sklearn can handle more than 2 classes with a one-vs-one method. It basically takes all the different pairs of labels and trains a seperate classifier for each one. All models are then run for prediction, each giving one output per class. The highest voted class is the winner. This this is quite computationnaly demanding, and I am doing this for demonstration, I will focus on two dimensions.
So I take the first two columns:
df.data.iloc[:, 0:2]
Let's first test the model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.data.iloc[:, :2], df.target)
from sklearn.svm import SVC
model = SVC(kernel='linear', C=100)
Note that you can use different types of kernels. In this case, I will be uing a basic linear SVC, but I recommend you look for other kernels. In particular, the polynonial
kernel (used in conjunction with a parameter degree
) can be quite useful in some cases.
Note as well the regularizatin parameter C
, here equal to 100.
As discussed above, this C
parameter influences the SVM optimization by providing a approximation of misclassition acceptance in each training example. In other words, when C
is high, the algorithm will choose a smaller-margin hyperplane if it gets more training points classified correctly. Conversely, when C
has a low value, it will cause the optimizer to look for a larger-margin separating hyperplane, accepting more misclassifications.
model.fit(X_train, y_train)
print('Accuracy on training set: {:.2f}'.format(model.score(X_train, y_train)))
print('Accuracy on test set: {:.2f}'.format(model.score(X_test, y_test)))
We can tune the hyperparameter C
to find the optimal value, using GridSearch:
from sklearn.model_selection import GridSearchCV
c = np.linspace(start = 0, stop = 300, num=50)
param_grid = {'C': c}
grid = GridSearchCV(model, param_grid =param_grid, cv=3, n_jobs=-1, scoring='accuracy')
grid.fit(X_train, y_train)
print("The best parameters C value is %s with a score of %0.0f" % (grid.best_params_, grid.best_score_ * 100 ))
print( "The best estimator accuracy on test set = {:.2f} ".format(grid.best_estimator_.score(X_test, y_test) * 100 ) )
Since we used a visualization at the beginning of the notebook, let's plot the decision boundaries of the clusters that SVM differentiated
As a reminder, we are looking at these two dimensions:
# plot the samples
plt.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, cmap=plt.cm.Paired, edgecolors='k');
The code below is inspired by the Sklear repo:
# Create meshgrid to plot
x_min, x_max = X_train.iloc[:,0].min() - 1, X_train.iloc[:,0].max() + 1
y_min, y_max = X_train.iloc[:,1].min() - 1, X_train.iloc[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
# predict position for meshdrig
Z = grid.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# plot the samples
plt.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, cmap=plt.cm.Paired, edgecolors='k')
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.5);
As mentioned earlier, the kernel used was a linear kernel, which can be observed in the decision boundaries above.
Let's quickly re-run the code above, this time using a polynomial kernel with 2 degrees
model = SVC(kernel='poly', degree=2)
model.fit(X_train, y_train)
# Tune Hyperparameter
c = np.linspace(start = 0, stop = 300, num=50)
param_grid = {'C': c}
grid = GridSearchCV(model, param_grid =param_grid, cv=3, n_jobs=-1, scoring='accuracy')
grid.fit(X_train, y_train)
print("The best parameters C value is %s with a score of %0.0f" % (grid.best_params_, grid.best_score_ * 100 ))
print( "The best estimator accuracy on test set = {:.2f} ".format(grid.best_estimator_.score(X_test, y_test) * 100 ) )
# Create meshgrid to plot
x_min, x_max = X_train.iloc[:,0].min() - 1, X_train.iloc[:,0].max() + 1
y_min, y_max = X_train.iloc[:,1].min() - 1, X_train.iloc[:,1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
# predict position for meshdrig
Z = grid.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# plot the samples
plt.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, cmap=plt.cm.Paired, edgecolors='k')
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.5);
As often with clustering algorithms, the extrapolation made per clusters can be quite bothering. If you look at the plot above, it seems quite odd that the algorithm is predicting a class for the whole upper left part of the graph, for which no data points are present. This is just a side note on such algorithms, where alternatives such as Mixture Gaussian models can be quite useful, since they would allow you to assess uncertainty, for which you can use a threshold above which no predictions could be made (up to you ;) )
That was it for Support Vector Machines
. Hope you enjoyed it, and drop a mail if you have any comments, questions or (constructive_ criticism).